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Abstract 

Background: The human genome contains a large amount of c/'s-regulatory DNA elements responsible for 
directing both spatial and temporal gene-expression patterns. Previous studies have shown that based on their 
mRNA expression breast tumors could be divided into five subgroups (Luminal A, Luminal B, Basal, ErbB2 + and 
Normal-like), each with a distinct molecular portrait. Whole genome gene expression analysis of independent sets of 
breast tumors reveals repeatedly the robustness of this classification. Furthermore, breast tumors carrying a TP53 
mutation show a distinct gene expression profile, which is in strong association to the distinct molecular portraits. 
The mRNA expression of 552 genes, which varied considerably among the different tumors, but little between two 
samples of the same tumor, has been shown to be sufficient to separate these tumor subgroups. 

Results: We analyzed in silico the transcriptional regulation of genes defining the subgroups at 3 different levels: 1. 
We studied the pathways in which the genes distinguishing the subgroups of breast cancer may be jointly involved 
including upstream regulators (1 st and 2 nd level of regulation) as well as downstream targets of these genes. 2. 
Then we analyzed the promoter areas of these genes (-500 bp tp +100 bp relative to the transcription start site) for 
canonical transcription binding sites using Genomatix. 3. We looked for the actual expression levels of the identified 
TF and how they correlate with the overrepresentation of their TF binding sites in the separate groups. We report 
that promoter composition of the genes that most strongly predict the patient subgroups is distinct. The class- 
predictive genes showed a clearly different degree of overrepresentation of transcription factor families in their 
promoter sequences. 

Conclusion: The study suggests that transcription factors responsible for the observed expression pattern in breast 
cancers may lead us to important biological pathways. 



Background 

Previous studies have shown that breast tumors can be 
divided into five subgroups (Luminal A, Luminal B, Nor- 
mal-like, ErbB2 over-expressing, and Basal-like) based on 
their mRNA expression patterns [1]. These patterns have 
been validated in independent datasets representing dif- 
ferent laboratories, platforms and different patient 
cohorts [2]. Survival analyses on a sub-cohort of patients 
with locally advanced breast cancer showed a significant 
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difference in outcome of the patients in the various ex- 
pression subgroups, with poor prognosis for the ErbB2 + 
and basal-like subtypes [2]. The expression of 552 genes, 
the intrinsic gene list, has been suggested to be sufficient 
to separate breast carcinomas into the five distinct 
subgroups. What mechanisms of common regulation 
make these genes cluster together? We have previ- 
ously shown that we can separate the patient clusters 
based only on the promoter composition of single 
binding sites in the promoters of the genes from the 
intrinsic gene list [3]. However, regulation of gene ex- 
pression in eukaryotes is highly complex and depends 
on sets of TFs rather than individual TFs [4] and 
in this study we attempt to characterize the overrepre- 
sentation of entire TF families. The promoter 
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composition of the genes is one of the major determi- 
nants of gene regulation including multiple transcrip- 
tion binding sites that interact with a specific 
combination of transcription factors (TF). Eukaryotes 
achieve this diversity by combining a small number of 
transcription factors whose activities are modulated by 
diverse sets of conditions [5]. Different functionalities 
can be conferred on one TF by its association with dif- 
ferent co-factors. These factors may act as global TFs 
that assist their gene-specific partners in their function, 
and may thus activate or repress transcription depend- 
ing on the partner motif and the condition [5]. Analyz- 
ing transcription network dynamics in yeast, Luscombe 
et al. showed that, in response to diverse stimuli, tran- 
scription factors may alter their interaction patterns to 
varying degree, thereby rewiring the network [6]. While 
few transcription factors serve as permanent hubs, most 
of them act transiently during certain conditions. Ex- 
ogenous processes like environmental responses facili- 
tated fast signal transductions to multiple genes with 
short regulatory cascades, whereas endogenous pro- 
cesses needed to progress through multiple stages with 
a complex combination of TFs to fewer target genes [6]. 
The same TFs may act both in endogenous and exogen- 
ous processes. Regulatory hubs targeting disproportion- 
ately large numbers of genes and thereby representing 
the most influential components of a network- have 
been described. Both Pilpel [5] and Luscombe [6] con- 
cluded that precise regulation of a condition cannot 
arise from the specificity of individual TFs, therefore 
combinatorial TF usage seems to be the key. The NF-kB 
family of TFs is an example of transcription regulators 
that are activated by both intra- and extra- cellular stim- 
uli such as cytokines, oxidant-free radicals, ultraviolet ir- 
radiation, and bacterial or viral products [7]. Aberrant 
NF-kB activity has been implicated in carcinogenesis 
and in the control of cellular response to anti-cancer 
agents. Activated NF-kB was detected predominantly in 
ER-negative breast tumors, and mostly in the ErbB2- 
over-expressing tumor subgroup [8]. 

Methods 

The in silico analysis of the transcriptional regulation 
of genes defining the subgroups was performed at 
three different levels: (1) Study of the pathways in 
which the genes distinguishing the subgroups of 
breast cancer may be jointly involved including up- 
stream regulators (1 st and 2 nd level of regulation) as 
well as downstream targets of these genes. (2) Then 
we analyzed the promoter areas of these genes (-500 
bp tp +100 bp relative to the transcription start site) 
for canonical transcription binding sites using Geno- 
matix. (3) We looked for the actual expression levels 
of the identified TF and how they correlate with the 



overrepresentation of their TF binding sites in the 
separate groups. 

Selection of genes 

The expression of 552 genes, the intrinsic gene list, 
which has been suggested to be sufficient to separate 
breast carcinomas into the five distinct subgroups 
defined in [1] and [2,9] was used for the pathway analysis 
in this study (referred to as full list). A subset consisting 
of 197 genes [10] that best represented the classification 
scheme in breast cancer (referred as top list) were 
selected from the intrinsic list, and used in the promoter 
analysis part (Additional file 1: Table SI). 

Pathway analysis 

Pathway analysis was performed using Pathway Studio 
[11] from Ariadne Genetics. Two network prediction 
algorithms were used that allow to discover the patterns 
of gene expression inherent in the experimental data: 
Pearson Correlation and Auto Net Finder network pre- 
diction algorithm. Pathway Studios text mining tools 
were applied to extract biological associations by mining 
PubMed to build pathways from extracted facts using 
data from recent publications and public and commer- 
cial databases such as KEGG, BIND, GO, and the 
PathArt database of curated signaling and disease path- 
ways. The algorithm for building Correlation Network in 
Pathway Studio is based on Pearson Correlation. Genes 
with similar expression profiles are connected with edges 
indicating the significance of the correlation. The group 
of tightly correlated genes form cluster in the correlation 
network. The algorithm can be used for clustering genes 
according to their expression profiles across multiple 
samples. The tool calculates correlation coefficients be- 
tween all pairs of gene expression profiles measured in 
the experiment and outputs clusters of highly correlated 
genes. Identified gene clusters can be further validated 
and analyzed using relations from the database that have 
been extracted from the literature by Ariadne Genetics. 
Auto Net Finder is a network estimation system that 
combines hierarchical clustering and Graphical Gaussian 
Modeling and is used for distinguishing direct and indir- 
ect relationship among variables. Bibliosphere pathways 
(release 7.1) [12] (http://www.genomatix.de, Genomatix 
Software GmbH) was used for extracting the associations 
between gene, transcription factor and proteins corre- 
sponding with the genesets defining each molecular sub- 
type of breast cancer. Genomatix Bibliosphere is a 
knowledge database consisting of manually curated co- 
cited genes in PubMed, which additionally provides in- 
formation about the presence of TFBS in their promo- 
ters, using in silico tool- Matlnspector, interactions and 
associated pathways from Molecular Interactions data- 
base-NetPro and BioCyc, respectively. 
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Analysis of overrepresentation of TFBS families in the 
promoter sequences 

We extracted the putative regulatory promoter regions 
from 500 bp upstream to 100 bp downstream of RefSeq 
promoters of the subtype-associated genes. Further ana- 
lysis was based on the hypothesis that overrepresentation 
of potential transcription factor binding site (TFBS) 
motifs in a set of co-expressed gene promoters may indi- 
cate regulatory relationship. In order to emphasize the 
functional representation of TFBS motifs overrepre- 
sented in a set of promoters, we used the TFBS matrix 
family concept. TFBS matrix families are defined as 
groups of TFBS weight matrices corresponding to the 
same or functionally similar transcription factors. For 
any given TF, there could be multiple matrices described 
by different independent sources, leading to multiple 
matches for similar position or shifting of matches by a 
few base pairs. By using the functional domain clustering 
based on di/tri/tetra-nucleotide occurrence and addition- 
ally function-based subgrouping, TFBS matrices can be 
grouped according to their functional similarity, known 
as TFBS families [13]. Thus members sharing same TFBS 
family are expected to have functional similarity in 
addition to binding domain similarity. For estimation of 
over-representation of each TFBS family, first occur- 
rences of its corresponding TFBS motifs within a set 
of subtype-specific promoter sequences was obtained. 
Then relative occurrence of each TFBS family was esti- 
mated by comparing this observed occurrence to the 
rate of occurrence of the same TFBS matrix family in 
an equal base-pair long reference background sequences 
from human promoter. Overrepresentations of a motif 
is measured by two different methods: 
1. In terms of fold factor of overrepresentation 
compared to the background 
Fold factor of TFBS overrepresentation was 
calculated by a formula as mentioned below: 



same TFBS in equally sized human genomic 
promoter sequence population. 



r(X) 



n exp (X) 



Where, r(X) = fold factor of overrepresentation of a 
TFBS family, X 

n obs PO = observed number of hits of X in a given set 
of promoter sequences 

n exp PC = expected number of hits of X in an equally 
sized sample from genomic promoter background 
sequences 

2. As z-scores that provide a measure of the distance of 
sample from the reference population mean. Here 
sample refers to the number of observed hits of any 
particular TFBS in a given input set of sequences 
and reference refers to the number of hits of the 



z(X) 



n<>bs(X) - n exp (X) -0.5 



z(X) is a z-score of overrepresentation of a transcrip- 
tion factor binding site family (X); 

n 0 bs PO 1S a number of observed hits of X in an input 
promoter sequences; 

n e xp PO is expected number of hits of X in an equally 
sized sample sequences in human genomic promoter 
background; 

S(X) is a population standard deviation of number of 
hits of X 

We used Genomatix RegionMiner tool (Genomatix 
Software GmbH, http://www.genomatix.de) in order 
to evaluate the degree of TFBS family overrepresen- 
tation. The histogram of z-scores of each TFBS motif 
families in each subtype-specific promoter sequences 
is shown in the Additional file 2: Figure SI. Histo- 
grams like this indicate that choosing the cut-off 
level of 2.0 allows identifying TFBS families that are 
overrepresented. However, z-score cut-off level of 2.0 
does not provide a precise measure of significance, 
because of the disparity of sample size between sam- 
ple and reference. Due to the copyright and technical 
limitations in accessing the Transfac database, further 
statistical testing of over-representation could not be 
performed within that tool. 

Under-representations or absence of TFBS family 
motifs in sub-type specific genes may occur due to a 
fewer number of subtype-representative genes and 
subsequently a smaller number of promoter 
sequences used for any particular subtype. This can 
be a source of false positivity. Therefore we have not 
taken into account the under-representations of 
TFBS family motifs in this analysis. 

Principal component analysis to identify TFBS with 
maximum variance between subtypes 

Principal component analysis (PC A) [14] was per- 
formed for ranking the TFBS families with respect to 
the variance of fold-factor overrepresentation con- 
tributed by them between five subtypes. We prepared 
a matrix of TFBS fold-factors for subtypes, with sub- 
types as columns and TFBS families as rows. We 
performed PCA on this matrix using the princomp 
function of Matlab. Subtracting each data point from 
the column mean represents a center of this matrix. 
Hotellings T 2 statistic was used as a measure of 
multivariate distance of each TFBS family from the 
center of the TFBS fold-factor matrix as described in 
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[http://www. mathworks. com/help /toolbox/ stats/prin- 
comp.html]. 

Gene expression data 

We used a subset of the samples (n = 114) from previ- 
ously published [15] mRNA expression data [GEO data- 
set #GSE19783]. Subtypes were predicted by using the 
PAM50 [16]. 

mRNA expression of the studied TF 

Transcription factor families with overrepresentation z- 
score >2.0 were mapped to their corresponding probes 
in the mRNA expressions dataset By applying multiclass 
SAM, we extracted 120 TF genes with significantly dif- 
ferent (at the FDR <0.1) expression between the five 
subtypes. Pearsons correlation between the subtype-spe- 
cific geometric mean expression of this subset of tran- 
scription factor genes and fold overrepresentation was 
computed. The justification of using geometric mean in- 
stead of arithmetic mean is that typically mRNA expres- 
sion values are log-normally distributed. 

Results and discussion 

Pathway analysis of the genes that define the five breast 
cancer subgroups 

Using Pathway Studio from Ariadne Genetics, we studied 
the direct interactions between the genes with distin- 
guished gene expression pattern in the breast cancer 
subgroups as described in Materials and Methods, selec- 
tion of genes. Most profound direct interactions were 
observed for the genes defining the luminal A group 
with protein-protein interactions between XBP1 and 
ESR1 and CCND1 (Additional file 3: Figure S2). Trefoil 
(TFF3) has been functionally coupled to CCND1 through 
angiotensin receptor 1 (AGTR1). Angiotensin II is con- 
verted from its precursor by angiotensin I-converting en- 
zyme (ACE) and has been shown to mediate growth in 
breast cancer cell lines via ligand-induced activity 
through the angiotensin II type 1 receptor (AGTR1). We 
also searched for upstream regulators as well as down- 
stream targets of these genes. Downstream targets could 
be observed centered at the ESR1, MYC, NFKB1, 
GATA3, CCND1, TP53 and MSX2/FOXC1 (Additional 
file 4: Figure S3). 

A somewhat less organized pathway structure is 
observed in the luminal B subclass. The ESR1 node was 
not observable and the TP53 network was more sparse 
with fewer partner genes. Novel nodes were centered at 
NRG1, GSTP1 and CUL1 (Additional file 5: Figure S4), 
CUL1 has homology to yeast Cdc53, which is part of a 
complex known as SCF that mediates the ubiquitin- 
dependent degradation of Gl cycles and cyclin- 
dependent kinase inhibitors, while NRG1 contains a do- 
main related to the epidermal growth factor family of 



ligands and can act as receptor agonists. The direct 
interactions between genes highly expressed in Luminal 
B subtype were observed between GSTP1 and CDK2AP1, 
S100A10 and S100A11 and PPP1R13B and TP53BP2. 
The latter protein interacts with TP53 to specifically en- 
hance p53-induced apoptosis but not cell cycle arrest. 

Four distinct regulatory nodes were observed in the 
ERBB2 group: around the ERBB2 itself, TP53, NFKB1 
and CTNNB1 (cadherin-associated protein, beta 1) 
(Additional file 6: Figure S5). NFkB-p65 was shown to 
repress (3-catenin-activated transcription of cyclin Dl 
[17]. Moreover, a direct interaction is established be- 
tween ERBB2 and GRB7 (Additional file 3: Figure S2). 
The solution structure of the Grb7-SH2/erbB2 peptide 
complex was described and suggested to be involved in 
cell signaling pathways that promote the formation of 
metastases and inflammatory responses. PPARBP } which 
is co-amplified with ERBB2, has in early studies been 
suggested to play a role in mammary epithelial differenti- 
ation and in breast carcinogenesis by its ability to func- 
tion as ESR1 coactivator. It was shown to contain a 
typical CCAT box and multiple cis-elements such as CI 
EBPbeta, YY1, c-ETS-1, API, AP2, and NFkappaB bind- 
ing sites. The 4 different regulatory nodes are connected 
by FLOT2, the human epidermal surface antigen 
involved in epidermal cell adhesion. NFKB1 was present 
in the network for the Basal group, where also the FOX 
family, a whole family of cyclins and CDK2, and CDK6 
and isoforms of protein kinase (RPS6K) were present 
(Additional file 7: Figure S6). Interestingly, a large num- 
ber of connections lead to GJA1 (Cap junction protein, 
alpha, also known as connexin 43). Other distinct nodes 
around TP53 are those connecting to KRT5, MAPK sig- 
nalling, E2F1 and NCL. NCL, Nucleolin, one of the most 
abundant nucleolar proteins, has been recently shown to 
be involved in the reprogramming of somatic cells for 
derivation of either embryonic stem (ES) cells, by som- 
atic cell nuclear transfer (SCNT), or ES-like cells, by 
induced pluripotent stem (iPS) cell procedure. Nucleolar 
proteins are proposed to be the markers of activation of 
embryonic genes [18] and provide mechanism for nucle- 
olar control of progression of cell cycle in stem cells and 
cancer cells [19]. TP53 was a central node in the regula- 
tory network of the normal-like subgroup, surrounded 
by JUN, ACSS2, ACSL1, KRT13, PIK3R1 and other 
nodes some representing glycolysis, energy metabolism, 
pyruvate metabolism and metabolism of carbohydrate 
(Additional file 8: Figure S7). 

Noteworthy, a TP53 network node was observed in 
each of the studied expression subclasses shown here 
(Additional file 4: Figure S3, Additional file 8: Figures 
S7). It is of interest to note that in every case TP53 was a 
hub in a somewhat different neighborhood. While in the 
basal subtype TP53 was connected to CDK6, a cyclin- 
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dependent protein kinase (CDKs) that regulate major cell 
cycle transitions and CDH3, cadherin 3, as well as FZD7 
and KRT5, in the luminal A tumors one could observe 
detoxifying enzymes such as NAT1, CYP2A6 as well as 
the retinoic acid receptor RARRES3 in the TP53 hub 
(Figure 1). 

Over-representation of specific transcription factor 
binding sites in the promoter of the genes that 
distinguish the subtypes 

The correlation matrix of TFBS fold-overrepresentation 
vectors for the five subtypes shows positive correlation 
in terms of potential TFBS family overrepresentation be- 
tween 1. ERBB2+ and basal subtypes (0.27); 2. Luminal B 
and ERBB2+ (0.16); 3. Luminal A and luminal B (0.11). 
In order to visualize the differential TFBS overrepresen- 
tation, we performed the principal component analysis 
(PCA). PCA plot (Figure 2) displays the significant differ- 
ences between the subtypes in terms of fold-factor of 
motif frequencies observed in promoter sequences of 
subtype-associated gene promoters compared to their 
corresponding normal frequencies in genomic promoter 



sequences. Distances between points representing the 
TFBS matrix families are the multivariate distances of 
fold-factor overrepresentation of each TFBS family in 
each of the subtype. This indicates that the shorter the 
distance, the greater similarity in fold-overrepresentation 
of that particular TFBS family in given subtypes. More 
than 60% and 76% of cumulative variance is captured by 
first two components and first three principal compo- 
nents, respectively. The top ten ranking TFBS families in 
distance from center and some of the functionally signifi- 
cant TFBS families are specifically labeled in the PCA 
plot. Biplots of first and second principal components 
show differentially overrepresentated TFBS families be- 
tween the normal-like and rest of the subtypes. Biplot of 
second and third principal components shows TFBS fam- 
ily overrepresentations in luminal B. Differential TFBS 
family representations between ERBB2+ and basal groups 
cannot be seen in biplots of first three principal compo- 
nents, but can be visualized in a biplot of first and fourth 
principal components. In the first principal component, 
V$BTBF, V$PAX1, V$PAX4 and V$TCFF are the major 
contributors of variance, where as V$PAX4, V$GUCE, 



SERPINH1 



PAX8 



TFBS match in 
target promoter 



AQP3 




CREB1 

Figure 1 Predicted functional relationship of TP53 in different molecular subtypes of breast cancer. Figure shows predicted interactions of 
genes or proteins with TP53. Source: Bibliosphere pathway database, {green edges: TF motif match found in target promoter of target genes; genes 
associated with basal subtype are shown as red nodes, ones with luminal A in blue, luminal B in cyan and normal-like as green nodes.) 
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Component 2 Component 1 

Figure 2 PCA plot of overrepresentated TFBS matrix families. PCA plot shown in terms of fold factor of overrepresentation in each subtype 
compared to the reference genomic promoters background. Blue lines represent the Eigen vectors, the direction and length of which indicates 
how each subtype variable contributes to the principal components in the plot. TFBS matrix families with maximum distance from the centroid 
are labeled on the plot. 

v.. J 



V$ARID are the major contributors of variance in the sec- 
ond principal component. 

Several of the gene clusters shared ds-elements that 
were present in more than 90% of the promoters. For 
the top six genes that classify the ErbB2+ over-expres- 
sing cluster, four TFBSs were found to be present in 
100% of the promoters. These were NOLF (Neuron-spe- 
cific-olfactory), ETSF (E26 Transformation-Specific fac- 
tor 1), STAT (the Signal Transducers and Activator of 
Transcription protein) and NF-kB (Nuclear Factor Kappa 
Beta) (Additional file 9: Table S2). NF-kB is the family of 
nuclear factor kappa beta of transcription factors. NF-kB 
has been shown to promote cell proliferation, to sup- 
press apoptosis, to promote cell migration, and suppress 
differentiation [7]. NF-kB binding sites were found 



significantly over-represented in the promoters that best 
classify the ErbB2 + subgroup compared to the other 4 
subgroups (Additional file 9: Table S2; Figure 3B) and 
78% of the 27 genes expressed in the basal-like subgroup 
had also NF-kB binding site in the promoter. This was in 
marked contrast compared to the promoter composition 
of the normal-like and luminal subgroups (Figure 3B). 
The presence of NF-kB binding sites in the genes from 
the ERBB2 and basal groups is in concordance with the 
pathway analysis performed on the downstream genes 
(see above). The ds-elements PAX1, PAX9 (The paired 
box gene 5), MAZF (myoassociated zinc finger) and 
EGRF (epidermal growth factor receptor) were overre- 
presented in the genes that are over-expressed in the Lu- 
minal B subgroup (Additional file 9: Table S2). While the 



A 




Luminal A 



Luminal B 



ErbB2 + 



Basal Normal 
like 



61% 



53% 



100% 



78% 



65% 



g Frequency of 

NF-KB TFBSs p=0.012 

^ Frequency of 

TP53 mutations p<0.000 16% 71% 86% 75% 66% 

Figure 3 Subtypes with relevance to NF-kB binding sites and TP53 mutations. A. The five subtypes shown by hierarchical clustering using 
the "intrinsic" gene set. Dendrogram shows the clustering of the tumors into five subgroups. Branches are color-coded. B. Frequency of NF-kB 
binding sites in the 5 subgroups; and C. Frequency of TP53 mutations. 
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PAX superfamily is involved in a multitude of develop- 
mental processes and is required for initiating B cell 
lineage and maintaining neural development and sperm- 
atogenesis, the MAZF is a common transcription factor 
and might play a more general role. The major distinc- 
tion between the luminal A and B, both consisting of ER 
positive tumors, is the presence of a strong proliferations 
cluster in the luminal B subtype. Noteworthy, binding 
sites for growth factors and their receptors like EGRF are 
over-represented in the promoters of the genes that de- 
fine the luminal B subgroup and were overrepresented in 
the pathway analysis as well (see above). EGRF is not 
only a receptor for EGF (Epidermal growth-factor), but 
also for other members of the EGF family and it is 
involved in the control of cell growth and differentiation. 
For the geneset of the normal-like subgroup, we 
observed overrepresentation of NRF1 family of TFBS 
(Additional file 9: Table S2). 

Presence of promoter modules in genes that define the 
ErbB2+ subgroup 

The specificity of promoter-controlled gene regulation 
may depend on the relative organization of the elements 
within the promoter rather than solely on individual ele- 
ments [20-22]. Genes expressed in the same functional 
context do often share promoter modules [20,21]. The 
binding elements are often occupied differently in 



different tissues, and these differences can be used to de- 
rive all type-specific sub-modules in silico. A promoter 
module may be defined as an organized group of regula- 
tory elements where both order and distance should be 
considered. Genes expressed in the same functional con- 
text do often share promoter modules [20,21]. For the 
six best genes of the ErbB2+ over-expressing cluster, a 
common framework consisting of NF-kB and ETS1 tran- 
scription factor binding sites was found (Figure 4). The 
ETS are fundamentally important TFs with roles in cell 
development, cell differentiation, cell proliferation, apop- 
tosis and tissue remodeling (reviewed [23]). The family is 
characterized by an evolutionarily conserved DNA-bind- 
ing domain that regulates expression by binding to a 
purine-rich core sequence in cooperation with other 
TFs. Most of the proteins in the ETS family are down- 
stream nuclear targets of ras-MAP kinase signaling, and 
the deregulation of ETS genes results in the malignant 
transformation of cells [24] It has previously been 
reported that mutant TP53 required ETS1 to synergistic- 
ally activate the expression of ABCB1. ETS1 was shown 
to interact exclusively with mutant TP53 in vivo, but not 
with wild-type TP53 [25]. High levels of ETS1 expression 
were associated with poorer prognosis [26]. The pres- 
ence of a promoter module constituting of NF-kB and 
ETS has been reported previously in genes co-regulated 
in mitogen-stimulated T-cells [27]. Interactions between 
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Figure 4 Common Framework in the ErbB2 + subgroup. The common framework consisting of NF-kB and Ets found in the 6 cluster defining 
genes of the ErbB2+ over-expressing subgroup. Distance to next element is between 29 and 79 bp (ETSF). Directions (up/down) of the elements 
indicate presence of hits on sense or antisense strands respectively. 
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Figure 5 Heatmap of fold overrepresentations for the TFBS families ranked according to their distance from center. Out of all, 25 TFBS 
matrix families top ranked according to their distance from the centroid of fold-overrepresentation matrix, are highlighted. 



members of the ETS family and NF-kB have been 
described previously. ETS1 induces IKKa expression. 
IKKa is a kinase that marks the NF-kB inhibitor IkB for 
degradation, and active NF-kB is translocated to the nu- 
cleus. ETS 1 -mediated activation of IKKa is negatively 
regulated by TP53 binding to ETS1. TP53 physically 
interacts with ETS1 and specifically inhibits ETS1 
induced IKKa promoter activity. Loss of TP53-mediated 
control over ETS1 dependent transactivation of IKKa 
may represent a novel pathway for the constitutive acti- 
vation of NF-kB mediated gene expression and therapy 
resistance in cancer cells [28] TP53 is therefore an ETS1 
and ETS2 target gene [29]. NF-kB controls a broad 



spectrum of genes by a variety of mechanisms in re- 
sponse to diverse environmental changes. NF-kB may be 
a universal regulator, while ETS could reflect cell-type or 
stimulation specific differences since ETS binding sites 
were detected in a fraction of the NF-kB controlled 
genes. 

Over-representation of TP53 mutations in the tumors that 
belong to the ErbB2 + and basal-like subgroups 

In human breast tumors, the two tumor subgroups exhi- 
biting the most prominent activation of putative NF-kB 
target genes (ErbB2 + and Basal-like) also harbored the 
highest frequency of p53 mutations. 86% of the patients 
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Figure 6 Correlation between TF overrepresentation and corresponding TF gene in each subtype. Correlation between geometric mean 
of TF gene expressions in each subtype (shown as red bars) and their corresponding TF matrix family overrepresented in subtype-specific 
promoter sequences (shown as blue points) is plotted. 



in the ErbB2 + subgroup had TP53 mutations in their 
tumors and all the genes that are abnormally expressed 
in this tumor type have NF-kB binding sites in their pro- 
moter (Figure 3C). There is an evidence that NF-kB can 
regulate TP53 expression and that NF-kB is required for 
TP53-dependent cell death [30]. In turn, TP53 activates 
NF-kB through the RAF/MEKl/p90 pathway [30]. The 
TP53 protein interacts with NF-kB and enhances its 
transcriptional activity and its anti-apoptotic efficacy. 
Over-expression of ErbB2 is known to induce the clas- 
sical NF-kB pathway [31,32]. The estrogen receptor (ER) 
can bind physically to NF-kB to inhibit its DNA binding 
functions, hitherto repressing gene expression [33]. 
Therefore the NF-kB pathway was shown to be a major 
stroma-tumor signaling mediator in ER negative tumors 
with over-expression of ErbB2 [8]. NF-kB signaling has 
been associated with doxorubicin resistance, and agents 
blocking NF-kB function have been proven beneficial in 
the treatment of tumors in combination with standard 
anti-cancer therapies [34]. 

Over-represented transcription factor families within the 
promoter sequences 

We observed the over-representation of V$BTBF (kaiso), 
V$OAZF and V$PAX8 in basal and ERBB2+ tumor asso- 
ciated gene promoters (Figure 5, Additional file 10: Table 
S3). Kaiso group of transcription factors are known to 
show nuclear accumulation during active mitosis [35] 
and their over-representation indicates potential func- 
tional role in these two subtypes showing aggressive 
tumor progression and high cell proliferation. PAX8 ac- 
tivity has also been observed in metastatic renal tumors 
[36]. Precise role of PAX8 and OAZF groups of tran- 
scription factors is yet unknown in breast cancers. 
ERBB2+ gene promoters also show over- representation 
of V$NFKB, Pleomorphic adenoma gene associated 
V$PLAG and ras-responsive element binding protein 
associated V$RREB families of TFBS. Activity of 
NFKappa B is already discussed in the earlier section. 
RREB1 activity plays a role in TP53 mediated apoptosis 
[37] that gets perturbed in absence of functional TP53, 



which is a common phenomenon in ERBB2+ tumors. 
Both luminal groups involve over-representation of PAX 
subgroup 1 member TFBSs- V$PAX1, V$PAX9 and 
V$ZF5F families. PAX9 activity is known to be a marker 
of better prognosis. Overrepresentation of V$P53F, 
V$HOXF, V$CLOX, V$PARF and V$GATA was 
observed specifically in luminal A group in which estro- 
gen receptor signaling is a predominant characteristic. 
The transcription factors corresponding to V$PARF 
group (PAR bZIP TFs) are mediators in oxidative stress- 
induced apoptosis [38]. In the luminal B group of pro- 
moters, we observed over-representation of V$EGRF, V 
$CTCF and V$EKLF etc. Egr-1 which corresponds to the 
V$EGRF family is known to be associated with cell cycle 
entry in response to growth stimuli [39]. We also 
observed significant over-representation of V$NRF1 in 
both normal-like and luminal B group of promoters. 
NRF-1 transcription factor is an oxidant-sensitive tran- 
scription factor, usually found in ER positive breast can- 
cers [40] and is shown to be associated with higher 
tumor grade [41]. 

By using the Wilcoxon rank sum test, we observed sig- 
nificantly elevated mRNA expressions of ESR1 and PGR 
in Luminal A or Luminal samples compared to the basal 
ones (p < 1.0e-6), with non-significant differences in 
ERBB2 expressions. As expected ERBB2 was significantly 
upregulated in ERBB2+ tumors along with downregu- 
lated ESR1 and PGR, compared to the rest (p < 1.0e-4). 
Regulation by many transcription factors shown overre- 
presented here in ER + ve or ER-ve subtypes is not well 
characterized in context of estrogen and progesterone re- 
ceptor activity. However, overrepresentation of some of 
the TFBS, such as GATA, BTBF, NF Kappa B - appear 
to be consistent with prevailing knowledge about the 
subtypes and their ER/PR or Her2 status. 

Thus functions of the TF genes corresponding to the 
over-represented TFBS families hint the predominant 
characteristics of the subtypes. Findings from the above 
in silico analysis will be further validated in reporter 
studies and ChIP analyses. The approach of identifying 
overrepresented TFBS in a set of coordinately expressed 
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genes under a particular disease class or condition can 
improve the specificity and noise tolerance [42]. How- 
ever, its main limitation is that it does not account for 
the role of local chromatin environment constituted by 
structural properties, epigenetic modification etc. The 
local chromatin environment can offer condition-spe- 
cific functionality to the existing TFBSs in a set of 
promoters. 

Promoter sequences extending from 500 bp upstream to 
100 bp downstream relative to TSS typically contain core 
promoter elements, CpG islands, downstream promoter 
element and other components of transcriptional machin- 
ery. Besides, this region has been demonstrated to have high 
density of positional as well as comparative TFBS [43], 
many of which are typically location sensitive. Thus limiting 
the analysis to this proximal promoter region, rather than 
analyzing the broader region (i.e. -1000 bp to +500 bp rela- 
tive to the TSS) - could reduce false positives in TFBS over- 
representation. However, by that very limitation we may 
omit important information about second alternative pro- 
moters and distant control loci, which are therefore outside 
the scope of this analysis. 

Correlation between actual abundance of TFs and 
frequency of their BS in the genes defining the clusters 

Some of the TFBS family overrepresentations were posi- 
tively correlated with the geometric means of subtype-spe- 
cific mRNA expressions of their corresponding TF genes. 
(Shown in Figure 6, Additional file 11: Table S4). The ra- 
tionale underlying the use of geometric mean is that gene 
expression intensity values follow lognormal distribution. 

Biological uncertainty in a correlation between the 
abundance of TFs and frequency of their BS might be 
attributed to several factors. The most common and ob- 
vious reason could be mutant or copy number altered 
TF. Moreover, here we have not accounted for the 
expressions of downstream targets of the TFs. It is note- 
worthy that mutations (point mutation and copy number 
alteration) in TFs can also have an impact on the level of 
expression of the downstream genes. For instance, a mu- 
tant TP53, which is still highly expressed, may not 
recognize the original binding sites anymore, leading to a 
drop in the expression of the target genes. 

Conclusion 

Here we report that the promoter composition of the 
genes that strongly predict the patient subgroups is dis- 
tinct. The gene classes showed a clear separation when 
based solely on their promoter composition. This finding 
suggests that studying those transcription factors asso- 
ciated to the observed expression pattern in breast can- 
cers may lead us to important biological pathways 
responsible for the regulation of gene expression in 
breast cancer. 



Additional files 



Additional file 1: Table SI. Subtype-specific gene list. Table shows 
the 197 subtype-specific best discriminatory genes, which is a subset of 
the intrinsic gene-list. 

Additional file 2: Figure SI. Histogram of z-scores of 
overrepresentation. Histogram of TFBS matrix family overrepresentation 
observed in subtype-specific promoters compared to the reference 
genomic promoter background shown as z-scores. 

Additional file 3: Figure S2. Direct interactions between genes 
defining subtypes. Subtype-relevant key driver interactions for Luminal 
A, B and ERBB2+ subtypes. 

Additional file 4: Figure S3. Protein-protein interactions and TF 
interactions associated with Luminal A subtype. Network shown here 
is based on the luminal A specific genelist. 

Additional file 5: Figure S4. Protein-protein interactions and TF 
interactions associated with Luminal B subtype. Network shown here 
is based on the luminal B specific genelist. 

Additional file 6: Figure S5. Protein-protein interactions and TF 
interactions associated with ERBB2+ subtype. Network shown here is 
based on the ERBB2+ subtype-specific genelist. 

Additional file 7: Figure S6. Protein-protein interactions and TF 
interactions associated with basal subtype. Network shown here is 
based on the basal subtype-specific genelist. 

Additional file 8: Figure S7. Protein-protein interactions and TF 
interactions associated with normal-like subtype. Network shown 
here is based on the normal-like subtype-specific genelist. 

Additional file 9: Table S2. TFBS overrepresentation in subtypes- 
specific gene promoters. List of significantly over-represented 
transcription factor binding site families in subtypes of breast cancers at 
the cut-off level of z- score > =2.0. 

Additional file 10: Table S3. Over-representation of potential TFBS 
in subtype-specific promoter sequences. Table shows the fold over 
representation of potential transcriptional factor hits (represented as TFBS 
families) in subtype- specific gene promoter sequences. 

Additional file 11: Table S4. Correlation between TFBS 
overrepresentation and mRNA expression of corresponding TF 
genes. Table displays the Pearson's correlation between the geometric 
mean of expression values of transcription factor genes in subtypes and 
fold overrepresentation of corresponding TFBS families. 
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